feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned · 2025-10-18T20:05:55Z

Related issues

Child of feat(RFC): A richer Expr IR #2572
Follows feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct #3173

Notes

Take advantage of
- pc.unique preserving order
- structs/lists
Questions
- How much can be performed without collection?

Tasks

Get all original tests passing
Fix by_dtype naive hash
- feat(expr-ir): Add dedicated selectors for problem children
- fix(expr-ir): Ensure by_dtype handles bare parametric types
fix(expr-ir): Ensure nested nodes expand correctly
Add DataFrame.partition_by
- The new internals required (for pyarrow) to support more kinds of over(*partition_by) are a superset of what's needed for it
- Getting that into a reliable state first means the problem for over is reduced to adapting vector functions to operate on those partitions
Fix non-strict selectors issues
- WIP (expr-ir/over-and-over-and-over-again...expr-ir/strict-selectors)
Adapting DataFrame.partition_by
- Reuse for Group_by.__iter__
- ~~Reuse~~ Adapt concept for over(*partition_by) (non-aggregating)
  - Concatenating partitions can be done with Acero using union or [pyarrow.concat_tables
  - (https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html)
  - See (thread)
- ~~Reuse~~ Adapt concept for for GroupBy.agg(<BinaryExpr>)

Child of #2572

#2572 (comment)

Resolves

`expected` is now taken from testing the same selector on `main`

Adopting what polars does is simpler than special-casing

https://github.com/narwhals-dev/narwhals/blob/ecde261d799a711c2e0a7acf11b108bc45035dc9/narwhals/_arrow/expr.py#L148-L156

…d-over-and-over-again

Aligning this is not important

Adding `parse_into_selector_ir` will require calling this a lot I'd rather skip using `re` when a more performant option is there

Still have some translations missing `by_index` will mean updating `matches_column` to *also* pass in the schema index

Supports selector input for partitions

- Already works, but I wanna add some optimizations for the single partition case - `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present

Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position

Left a comment in `selectors` about this issue earlier

MarcoGorelli · 2025-10-23T09:42:32Z

narwhals/_plan/arrow/group_by.py

+    for idx in range(len(arr_dict.dictionary)):
+        # NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`
+        # Is there any reasonable way to do this in Acero?
+        yield native.filter(pc.equal(pa.scalar(idx), indices))


is this for use in over(partition_by=...)?

if so, just as a heads up, we won't be able to accept a solution which involves looping over partitions in python

Oh hi Marco, fancy seeing you here 😄

is this for use in over(partition_by=...)?

No this part is just for DataFrame.partition_by("one_column").
The multi-column variant of that is run in the c++ engine.

For over(partition_by=...) I need the partitions put back together and only for a single column vs a full table here.
These are different enough that I'd more likely use union (41d8cc2) and keeping things threaded will probably make more sense.
I'm also thinking ahead for over(*partition_by, order_by=...) which will be inserting one of these nodes into the plan

Edit: I've just updated the description to try and be clearer that these are related problems - but only in a conceptual sense

But to clarify

we won't be able to accept a solution which involves looping over partitions in python

This isn't looping over partitions, it is looping over an index into the partition key.
The index is being used to create a mask from indices - which AFAICT isn't expensive.
At the very least, I'm expecting this to be cheaper than what ArrowGroupBy.__iter__ currently does. Which involves casting + concatenating columns, before filtering on the resulting keys.

But the note ...

Is there any reasonable way to do this in Acero?

... is me leaving myself a TODO to try and benchmark that later 😅

dangotbanned added 3 commits October 18, 2025 18:36

feat(expr-ir): Support over(*partition_by)

1dfc6eb

Child of #2572

test: Start porting over_test.py

56c6049

test: hmm any multi-selection

dd62ae8

dangotbanned mentioned this pull request Oct 18, 2025

feat(RFC): A richer Expr IR #2572

Draft

71 tasks

test: Add some failing cases

ad47ac3

#2572 (comment)

dangotbanned added needs investigation internal labels Oct 19, 2025

dangotbanned added 2 commits October 19, 2025 13:34

fix(expr-ir): Ensure nested nodes expand correctly

c4d4030

Resolves

test: Update test that caught a different bug 😅

bf7fdc2

`expected` is now taken from testing the same selector on `main`

dangotbanned removed the needs investigation label Oct 19, 2025

dangotbanned added 3 commits October 19, 2025 14:09

test: Another case

d21ae34

feat(expr-ir): Add dedicated selectors for problem children

88dfdbc

Adopting what polars does is simpler than special-casing

fix(expr-ir): Ensure by_dtype handles bare parametric types

ecde261

dangotbanned added enhancement New feature or request fix labels Oct 19, 2025

dangotbanned added 15 commits October 19, 2025 19:19

feat: Support diff, shift

a2d5b2e

feat(expr-ir): Support anonymous reductions in over

5fc4075

https://github.com/narwhals-dev/narwhals/blob/ecde261d799a711c2e0a7acf11b108bc45035dc9/narwhals/_arrow/expr.py#L148-L156

feat(expr-ir): Partial cum_sum support

2a4acd7

feat(expr-ir): Rinse/repeat other cum_*

d97d047

Merge remote-tracking branch 'upstream/oh-nodes' into expr-ir/over-an…

b49a4f4

…d-over-and-over-again

simply document existing issue

34cac04

diff -> kernel, add some typing

79056d7

test: remove unused xfail

1f830bc

shift -> kernel, add fancy test

aedc330

test: Allow the exception difference *for now*

040d377

Aligning this is not important

feat(expr-ir): Add a concrete impl for cs.by_name

ab7330a

Adding `parse_into_selector_ir` will require calling this a lot I'd rather skip using `re` when a more performant option is there

feat(expr-ir): Add meta.as_selector, parse_into_selector_ir

15c87ea

Still have some translations missing `by_index` will mean updating `matches_column` to *also* pass in the schema index

more partition_by prep

a00dbb7

feat(expr-ir): Implement ArrowDataFrame.partition_by

e0d1a00

Supports selector input for partitions

test: Add test_partition_by_multiple

2bffdaa

dangotbanned added 8 commits October 22, 2025 08:58

test: Include None in partitions

f17781a

- Already works, but I wanna add some optimizations for the single partition case - `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present

perf: Add an optimized path for single-column partition_by

ac779dd

Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position

refactor: Re-use partition_by in ArrowGroupBy.__iter__

c4d494a

refactor: Move partition_by impl to group_by.py

e55aeb0

refactor: Rename concat_str -> _composite_key and lightly doc

ae09fc1

test: Add some more targets for polars-parity

9810b73

Left a comment in `selectors` about this issue earlier

fix: raise on empty by

6d219f4

feat(DRAFT): Add acero union wrapper

41d8cc2

MarcoGorelli reviewed Oct 23, 2025

View reviewed changes

dangotbanned mentioned this pull request Oct 23, 2025

refactor(expr-ir): Even concrete-ier Selectors #3233

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Support `over(*partition_by)` #3224

feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned commented Oct 18, 2025 •

edited

Loading

Uh oh!

MarcoGorelli Oct 23, 2025

Uh oh!

dangotbanned Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support over(*partition_by) #3224

Are you sure you want to change the base?

feat(expr-ir): Support over(*partition_by) #3224

Conversation

dangotbanned commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Notes

Tasks

Uh oh!

MarcoGorelli Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support `over(*partition_by)` #3224

feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned commented Oct 18, 2025 •

edited

Loading

dangotbanned Oct 23, 2025 •

edited

Loading